Thank you very much. It is really helpful for my project in the university! And it would be really helpful if you prepare a video about application of descriptors in QSAR, machine learning…. Please!
Thanks for this tutorial Dr. Gashaw, could you give a line of code that could list the 1826 Mordred Descriptors just like you listed the RDkit descriptors during the video. I tried to do the same for Mordred, but it didn't work.
Please try the following: print(Calculator(descriptors, ignore_3D=False).descriptors). It should print all the 1826 descriptors. Please take a look at the following reference for further information github.com/mordred-descriptor/mordred/blob/develop/examples/020-single_mol-multiple_desc.py
Very good question! In these chemical series, every time I run the notebook only three rows have on bits in the first column, and the rest have zero values. Morgan_fingerprints.Col_0.value_counts() 0 2870 1 3 You can rerun it. Probably, you will get the same result. My guess is that it encodes the same chemical fragment otherwise it should give me different values every time I run it, but I am not sure what fragment was encoded in the first column. You can take look at the following RDKit blog to know more about bit rendering. rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
Thank you! Amazing tutorial. How can I download the mordred data frame to an excel file?. I have tried ,but several errors are shown because mordred_descriptors' type is a MordredDataFrame not a "normal" DataFrame type. I mean, it has different methods and it is not possible to create the file.
If you have a mordred data frame in a jupyter notebook like mine, you can save it as csv file like this: mordred_descriptors.to_csv('mordred_descriptors.csv',index=False) Then, you can open it in excel without any problem. Once you opened it in Excel, you can save it as an excel file. I hope this helps!
During the canonical smiles part. the number of compounds 2904 before and after running the function are the same. that means aren't any canonical smiles?
The number of SMILES before and after running the function is the same and the function does not remove duplicate SMILES. That is why we need to remove duplicate structures after running the function.
No problem. It is prepared for beginners. Use the following youtube video to get a sense of how google colab works: ua-cam.com/video/6Xt6L1I5jScI/v-deo.htmlt After that, you can open it in google colab and save it in your account, and then, you can replace my data with yours.
Good question!. Here is the publication for Mordred, which is free to read, but I do not remember if they put detailed information about each descriptor. jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y#abbreviations
Thank you so much for this video, its really helpful. a quick question, can i choose which descriptors i want to calculate from Mordred instead of calculating the whole 1826. especially that after feature selection i know which ones i want to include in my model, and thus when testing new inputs. this shall help save so much time.
Great video Sir! I am a beginner in software but very interested and I just didn’t fully understand how I plug in my SMILES strings in the program. Could somebody explain it? Thanks in advance 🙏🏻
As far as you are interested, learning programming or anything for that matter is much easier than 10 years ago. Knowing basic Python is needed to understand my videos. You can learn basic Python using W3 schools www.w3schools.com/python/default.asp or using UA-cam videos from freecodecamp ua-cam.com/video/eWRfhZUzrAc/v-deo.html and many others
A good demonstration, especially of coding proecedure, a little difficult to understand some pronunciation, for instance Morgan fingerprint, it was useful to have captioning available.
Thank you for sharing this video. I keep getting this error on Rdkit features.: XGBoostError: [01:58:25] ../src/data/data.cc:1104: Check failed: valid: Input data contains `inf` or `nan`. I have used dropna() but I still can't get rid of the error. Also when I try to see if there are any nan or infinite values it shows 0.
If some rdkit descriptors have infinity or nan values for your compounds, you need to identify such columns and drop them before feeding the molecular descriptors to the XGBoost algorithm.
Hi Ahmed, If your file is not in GitHub, you need to upload it to your google drive, mount it and work from that. Please take a look at the following video starting from 9:00 min. He clearly described it. ua-cam.com/video/6Xt6L1I5jSc/v-deo.html I hope this would help.
Thank you Sir for your valuable knowledge. I followed the tutorial but i was not getting the molecular descriptors for mordred properly, for for descriptors it was showing 3D missing, what should i do? do i need to optimize the geometry? becuase i use the canonical smiles from pubchem for my molecules. Thanks
thank you so much for your grate video. could you please make a video about 4D descriptors? could you please name some software that I can calculate 4D descriptors?
When I was in graduate school, I used a commercial software called SYBYL-X to calculate COMFA. Now, the software is not available. I am not sure if there is free and reliable software that can calculate 4D descriptors. 4D descriptors are computationally expensive so that may be the main reason that we do not have many options. I am not sure how good they are, but take a look the following links. www.ra.cs.uni-tuebingen.de/software/4DFAP/ github.com/rougeth/Web-4D-QSAR
Good question! They are in the panda’s data frame. We can save them as a csv file. For example, in the notebook, mordred descrioptors were stored in a variable name mordred_descriptors. It can be saved as follows. mordred_descriptors.to_csv('mordred_descriptors.csv', index = False) This should save your Mordred descriptors in your current working directory.
@@Dr.Gashaw.M_Goshu Prof, when I calculated the molecule using mordred, It had 2 values for each feature. I need One canonical smile has one value for each feature. May I delete one of them Prof? So, Why does it happen Prof?🙏
I am not sure why you are getting duplicate values. If you follow the steps in the notebook, you should not get duplicate SMILES. I would suggest, you need to remove duplicate SMILES first before calculating molecular descriptors.
@@Dr.Gashaw.M_Goshu I mean, for the previous calculating using rdkit my dataset is 270 Canonical Smiles. But, after calculating Rdkit, the results was more than 270. So, is it okay Prof?
@@Dr.Gashaw.M_Goshu can you please explain why hydrogen atoms are added to the structures and then, molecular descriptors were calculated for each compound?
@@AGnanaprakasam The hydrogens are still included in the structure. What I did was, make them explicit. We can try with and without adding hydrogens. Then, we can take the one that gave better results. In my case, I got slightly better results when hydrogens are explicit. That is why I added hydrogens.
@@Dr.Gashaw.M_Goshu You are right. I just ran in colab and it didn't see any issues either. When I installed the modules and python 3.9 in my local computer and ran the cells, it gave errors (maybe changes in the packages).
OK, I rewrote the first function, now it worked: # get descriptor table def calculate_descriptors(smiles_list): # List of descriptor names descriptor_names = [desc_name[0] for desc_name in Descriptors._descList] # Create a MoleculeDescriptors calculator calculator = MoleculeDescriptors.MolecularDescriptorCalculator(descriptor_names) # Initialize a list to hold the descriptors for each molecule descriptors_list = [] for smiles in smiles_list: # Convert SMILES to RDKit molecule mol = Chem.MolFromSmiles(smiles) if mol is not None: # Calculate descriptors descriptors = calculator.CalcDescriptors(mol) descriptors_list.append(descriptors) else: # Append None or some indication of invalid SMILES descriptors_list.append(None) return descriptor_names, descriptors_list # function usage: smiles_list = dataset_new['SMILES'] descriptor_names, descriptors = calculate_descriptors(smiles_list) df_with_200_descriptors = pd.DataFrame(descriptors, columns = descriptor_names)
Thaks for upload this! helps a loooot!
Thank you so much for your efforts.
very useful !! thanks for sharing c:
I am glad it helps!
Thanks a lot for this amazing tutorial! It helps me a lot.
Thank you very much. It is really helpful for my project in the university! And it would be really helpful if you prepare a video about application of descriptors in QSAR, machine learning….
Please!
I am glad it helps. Thanks for your suggestion.
@@Dr.Gashaw.M_Goshu Dear Sir, I am new one. How can I run my data on your link? Thanks so much.
Very interesting, Thanks.
You are welcome!
Thank you so much, sir. Can you teach me about protein descriptors too?
Thanks for this tutorial Dr. Gashaw, could you give a line of code that could list the 1826 Mordred Descriptors just like you listed the RDkit descriptors during the video. I tried to do the same for Mordred, but it didn't work.
Please try the following:
print(Calculator(descriptors, ignore_3D=False).descriptors). It should print all the 1826 descriptors.
Please take a look at the following reference for further information github.com/mordred-descriptor/mordred/blob/develop/examples/020-single_mol-multiple_desc.py
Amazing video! About the FingerPrints, every time they are generated, does column 1 always correspond to the same chemical fragment?
Very good question! In these chemical series, every time I run the notebook only three rows have on bits in the first column, and the rest have zero values.
Morgan_fingerprints.Col_0.value_counts()
0 2870
1 3
You can rerun it. Probably, you will get the same result. My guess is that it encodes the same chemical fragment otherwise it should give me different values every time I run it, but I am not sure what fragment was encoded in the first column. You can take look at the following RDKit blog to know more about bit rendering. rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
Thank you! Amazing tutorial. How can I download the mordred data frame to an excel file?. I have tried ,but several errors are shown because mordred_descriptors' type is a MordredDataFrame not a "normal" DataFrame type. I mean, it has different methods and it is not possible to create the file.
If you have a mordred data frame in a jupyter notebook like mine, you can save it as csv file like this:
mordred_descriptors.to_csv('mordred_descriptors.csv',index=False)
Then, you can open it in excel without any problem. Once you opened it in Excel, you can save it as an excel file. I hope this helps!
During the canonical smiles part. the number of compounds 2904 before and after running the function are the same. that means aren't any canonical smiles?
The number of SMILES before and after running the function is the same and the function does not remove duplicate SMILES. That is why we need to remove duplicate structures after running the function.
Dear Sir, I am new one. How can I run my data on your link? Thanks so much.
No problem. It is prepared for beginners. Use the following youtube video to get a sense of how google colab works: ua-cam.com/video/6Xt6L1I5jScI/v-deo.htmlt
After that, you can open it in google colab and save it in your account, and then, you can replace my data with yours.
@@Dr.Gashaw.M_Goshu thanks so much for your replies. Dear Sir, I only have SMILES, Is it possible to calculate the energy gap from SMILES?
@@Dr.Gashaw.M_Goshu Is there any file to explain 200 descriptors, fingerprint, and 1826 descriptors? thanks
Good question!. Here is the publication for Mordred, which is free to read, but I do not remember if they put detailed information about each descriptor. jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y#abbreviations
Thank you so much for this video, its really helpful. a quick question, can i choose which descriptors i want to calculate from Mordred instead of calculating the whole 1826. especially that after feature selection i know which ones i want to include in my model, and thus when testing new inputs. this shall help save so much time.
Great video Sir! I am a beginner in software but very interested and I just didn’t fully understand how I plug in my SMILES strings in the program. Could somebody explain it? Thanks in advance 🙏🏻
As far as you are interested, learning programming or anything for that matter is much easier than 10 years ago. Knowing basic Python is needed to understand my videos. You can learn basic Python using W3 schools www.w3schools.com/python/default.asp or using UA-cam videos from freecodecamp ua-cam.com/video/eWRfhZUzrAc/v-deo.html and many others
A good demonstration, especially of coding proecedure, a little difficult to understand some pronunciation, for instance Morgan fingerprint, it was useful to have captioning available.
Thank you for sharing this video.
I keep getting this error on Rdkit features.:
XGBoostError: [01:58:25] ../src/data/data.cc:1104: Check failed: valid: Input data contains `inf` or `nan`.
I have used dropna() but I still can't get rid of the error. Also when I try to see if there are any nan or infinite values it shows 0.
If some rdkit descriptors have infinity or nan values for your compounds, you need to identify such columns and drop them before feeding the molecular descriptors to the XGBoost algorithm.
Try np.where(np.isnan(a) | np.isinf(a), 0)
This line replaces all nan, inf, -inf to 0
How do I get chemical compounds into colab server from a file in sdf format?
Hi Ahmed,
If your file is not in GitHub, you need to upload it to your google drive, mount it and work from that. Please take a look at the following video starting from 9:00 min. He clearly described it. ua-cam.com/video/6Xt6L1I5jSc/v-deo.html
I hope this would help.
Thank you Sir for your valuable knowledge. I followed the tutorial but i was not getting the molecular descriptors for mordred properly, for for descriptors it was showing 3D missing, what should i do? do i need to optimize the geometry? becuase i use the canonical smiles from pubchem for my molecules. Thanks
Try ignoring 3D descriptors as shown below:
calc = Calculator(descriptors, ignore_3D=True)
thank you so much for your grate video. could you please make a video about 4D descriptors? could you please name some software that I can calculate 4D descriptors?
When I was in graduate school, I used a commercial software called SYBYL-X to calculate COMFA. Now, the software is not available. I am not sure if there is free and reliable software that can calculate 4D descriptors. 4D descriptors are computationally expensive so that may be the main reason that we do not have many options. I am not sure how good they are, but take a look the following links. www.ra.cs.uni-tuebingen.de/software/4DFAP/
github.com/rougeth/Web-4D-QSAR
Prof, what is the code to save the results of rdkit and mordred?
Good question! They are in the panda’s data frame. We can save them as a csv file. For example, in the notebook, mordred descrioptors were stored in a variable name mordred_descriptors. It can be saved as follows.
mordred_descriptors.to_csv('mordred_descriptors.csv', index = False)
This should save your Mordred descriptors in your current working directory.
@@Dr.Gashaw.M_Goshu thank you prof
Nice
Sir, can you more explain about canonical smiles? What is the difference of SMILES and canonical SMILES? Thank you Sir
A molecule can be represented by more than one SMILES, but a molecule cannot have more than one canonical SMILES. That is the basic difference.
@@Dr.Gashaw.M_Goshu I understand, thank you so much Sir
@@Dr.Gashaw.M_Goshu Prof, when I calculated the molecule using mordred, It had 2 values for each feature. I need One canonical smile has one value for each feature. May I delete one of them Prof? So, Why does it happen Prof?🙏
I am not sure why you are getting duplicate values. If you follow the steps in the notebook, you should not get duplicate SMILES. I would suggest, you need to remove duplicate SMILES first before calculating molecular descriptors.
@@Dr.Gashaw.M_Goshu I mean, for the previous calculating using rdkit my dataset is 270 Canonical Smiles. But, after calculating Rdkit, the results was more than 270. So, is it okay Prof?
We find molecular descriptors after hydrogen explicit here is the finding descriptors with hydrogen can you confirm
Yes, hydrogen atoms are added to the structures and then, molecular descriptors were calculated for each compound.
@@Dr.Gashaw.M_Goshu but some of 2D indices we usually deplite the hydrogen will it do by itself when it is calculation of those
@@Dr.Gashaw.M_Goshu can you please explain why hydrogen atoms are added to the structures and then, molecular descriptors were calculated for each compound?
@@AGnanaprakasam The hydrogens are still included in the structure. What I did was, make them explicit. We can try with and without adding hydrogens. Then, we can take the one that gave better results. In my case, I got slightly better results when hydrogens are explicit. That is why I added hydrogens.
@@Dr.Gashaw.M_Goshu Thank you
Thank you. In 2024, the function to calculate RDKit and Mordred descriptors give error.
I ran it in Colab a few minutes ago and did not see any issues. I am not sure why it did not work for you.
@@Dr.Gashaw.M_Goshu You are right. I just ran in colab and it didn't see any issues either. When I installed the modules and python 3.9 in my local computer and ran the cells, it gave errors (maybe changes in the packages).
OK, I rewrote the first function, now it worked:
# get descriptor table
def calculate_descriptors(smiles_list):
# List of descriptor names
descriptor_names = [desc_name[0] for desc_name in Descriptors._descList]
# Create a MoleculeDescriptors calculator
calculator = MoleculeDescriptors.MolecularDescriptorCalculator(descriptor_names)
# Initialize a list to hold the descriptors for each molecule
descriptors_list = []
for smiles in smiles_list:
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(smiles)
if mol is not None:
# Calculate descriptors
descriptors = calculator.CalcDescriptors(mol)
descriptors_list.append(descriptors)
else:
# Append None or some indication of invalid SMILES
descriptors_list.append(None)
return descriptor_names, descriptors_list
# function usage:
smiles_list = dataset_new['SMILES']
descriptor_names, descriptors = calculate_descriptors(smiles_list)
df_with_200_descriptors = pd.DataFrame(descriptors, columns = descriptor_names)