String handling in pandas¶
String handling methods are indicated by using the .str
notation.
Example 1:
df_client["Subject"].str.rsplit("-", 1, expand=True)[1]
This will split the Subject column based on the last occurance of the "-" character, return a list object of the split string (expand), and then select the last element of the list.
Example 2:
df_client["Subject"].str.strip()
This will remove any surrounding whitespace from the elements of the Subject column.
Methods for selecting the last character¶
RPartition
df_egencia["Subject"].str.rpartition("-")[2].str.strip()
RPartition returns a list (similar to rsplit
's expand=True
). The element [2] will be the split off item.
RSplit with expand=True
df_egencia["Subject"].str.rsplit("-", 1, expand=True)[1]
expand=True
returns a list.
RSplit with str
index
df_client["Subject"].str.rsplit("-",1).str[-1]
.str
instructs pandas to treat the entry as a string?
It allows indexing on the list, with [-1]
returning the last item as normal
See indexing-with-str
List comprehension
df_client['last_col'] = [x.rsplit("-", 1)[-1] for x in df_client["Subject"]]
Note: This will automatically add the list to the dataframe as a column.
Column handling¶
Rename all columns
df.columns = ['a', 'b']
Rename one or more columns
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy)
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
Filtering¶
Filtering more than one column or predicate¶
The boolean AND and OR can be used to join conditions together, using the following symbols:
& AND
| OR
~ NOT
Filtering using string functions¶
str.contains
for partial matches
str.match
for a full match
Examples:
df.str.contains("Good ")
df.str.match("Valid")
Filtering¶
Filtering a DataFrame by more than 1 column¶
Note: axis refers to whether an action is row-based (0, default) or column-based (axis=1).
Use axis = 1
df[df.apply(lambda row: ('chrome' in row['filename'] and row['my_time'] >= '2017-01-01'), axis = 1) ]
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
d1[(d1['AMP'] == 'AMP6') & (df1['Portfolio'] == 'Water Networks')]
https://stackoverflow.com/questions/35186291/how-do-i-delete-rows-not-starting-with-x-in-pandas-or-keep-rows-starting-with
df = df[~df['c'].astype(str).str.startswith('1')]
where ~ means is not (negation)
https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p
df[df.C.str.contains("XYZ") == False]
Dropna (Pandas)¶
https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan
https://stackoverflow.com/questions/36370839/better-way-to-drop-nan-rows-in-pandas
df.dropna() #drop all rows that have any NaN values
df.dropna(how='all') #drop only if ALL columns are NaN
df.dropna(thresh=2) #Drop row if it does not have at least two values that are not NaN
df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question)
For reference - pandas.notnull , np.isfinite¶
df = df[pd.notnull(df['EPS'])]
df = df[np.isfinite(df['EPS'])]
Merge, Join, Concat (Pandas)¶
https://pandas.pydata.org/pandas-docs/stable/merging.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html
https://chrisalbon.com/python/pandas_join_merge_dataframe.html
https://stackoverflow.com/questions/41815079/pandas-merge-join-two-data-frames-on-multiple-columns
https://stackoverflow.com/questions/18792918/pandas-combining-2-data-frames-join-on-a-common-column
Read from Excel (Pandas)¶
https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook
https://stackoverflow.com/questions/24709108/python-pandas-how-could-i-read-excel-file-without-column-label-and-then-insert
df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = "B,C,G:L")
names = ["PRN", "Projects", "Status", "AMP", "Portfolio", "Programme", "Type", "Delivery Route"]
sheet_in = 'Project Details (2)'
file_in = "PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx"
Old method
xl = pd.ExcelFile("Path + filename") df = xl.parse("Sheet1")
xls_file = pd.ExcelFile('PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx') xls_file.sheet_names
https://stackoverflow.com/questions/17335649/how-to-specify-column-names-while-reading-an-excel-file-using-pandas
https://stackoverflow.com/questions/33655127/how-to-read-certain-columns-from-excel-using-pandas-python
import pandas as pd import numpy as np file_loc = "path.xlsx" df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = "A,C:AA") print(df)
parse_cols : int or list, default None If None then parse all columns, If int then indicates last column to be parsed If list of ints then indicates list of column numbers to be parsed If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)
Limiting number of rows (Pandas)¶
https://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.head.html
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
d1 = df.head(n=882)
(equivalent to SQL TOP)
https://stackoverflow.com/questions/23853553/python-pandas-how-to-read-only-first-n-rows-of-csv-files-in
If you only want to read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you only want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files*
skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
and for large files, you'll probably also want to use chunksize:
chunksize : int, default None Return TextFileReader object for iteratio
https://stackoverflow.com/questions/35747476/in-pandas-whats-the-equivalent-of-nrows-from-read-csv-to-be-used-in-read-ex
If you know the number of rows in your Excel sheet, you can use the skip_footer parameter to read the first n - skip_footer rows of your file, where n is the total number of rows.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
Usage:
data = pd.read_excel(filepath, header=0, parse_cols = "A:D", skip_footer=80)
Assuming your excel sheet has 100 rows, this line would parse the first 20 rows.
I'd like to make (extend) @Erol's answer bit more flexible.
Assuming that we DON'T know the total number of rows in the excel sheet:
xl = pd.ExcelFile(filepath)
parsing first (index: 0) sheet¶
total_rows = xl.book.sheet_by_index(0).nrows
skiprows = 4 nrows = 20
calc number of footer rows¶
(-1) - for the header row¶
skip_footer = total_rows - nrows - skiprows - 1
df = xl.parse(0, skiprows=skiprows, skip_footer=skip_footer, parse_cols="A:D") \ .dropna(axis=1, how='all')
.dropna(axis=1, how='all') will drop all columns containing only NaN's
Export to CSV (Pandas)¶
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
d1.to_csv('PPM.csv')
Import from Excel¶
df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = "B,C,G:L")
Grouping column¶
https://stackoverflow.com/questions/44577367/python-dataframe-replace-not-recognizing-strings
filtered["Grouping"] = filtered["Name"].replace(list(self.map_group["To Group"]), list(self.map_group["Group To"]))
print("a name" in list(self.map_group["To Group"]))
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
d1.groupby('Portfolio').size()
d1.groupby(['Portfolio', 'Type']).size()
https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p
d1[d1.Type.str.contains('OPEX') == False].groupby(['Portfolio', 'Type']).size()
Float precision IsNull mean¶
https://stackoverflow.com/questions/44912518/python-pandas-dataframe-converted-to-floats
round(average_petal_width, 2)
pd.set_option('precision', 2)
average_petal_width = iris_data.loc[iris_dat['class'] == 'Iris-setosa', 'petal_width_cm'].mean()
iris_data.loc[(iris_data['class'] == 'Iris-setosa') & (iris_Data['petal_width_cm'].isnull()), 'petal_width_cm'] = average_petal_width
How to make good reproducible pandas examples¶
https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples
Joining dataframe ranges¶
https://stackoverflow.com/questions/44551864/joining-pandas-dataframe-on-hour-range
I have two Pandas dataframes, one with weather information, given hourly, and one with bus arrival data, given approximately every minute or so.
I want to join the dataframes so bus data is associated with weather for that hour e.g. bus data between 1pm and 2pm on 01/06/2012 will be associated with corresponding weather data.
pd.merge_asof(bus_df,weather_df, on='DateTime')
Running Total¶
https://stackoverflow.com/questions/45122444/pandas-row-addition
You can use cumsum and then plus the base number. (df[::-1].Total-df[::-1].Amount.cumsum()).max() gives you the total prior to the last row. you can then add that base to the cumsum results. Also, as your data is added from bottom up, you need to reverse the rows using df[::-1] first.
df['Total_New']=df[::-1].Amount.cumsum()+(df[::-1].Total-df[::-1].Amount.cumsum()).max()
Drop_Row_with_conditions¶
From http://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression
To directly answer this question's title (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the drop method:
df = df.drop(some labels)
df = df.drop(df[
Example¶
To remove all rows where column 'score' is < 50:¶
df = df.drop(df[df.score < 50].index)
In place version (as pointed out in comments)¶
df.drop(df[df.score < 50].index, inplace=True)
Multiple conditions¶
(see Boolean Indexing)
The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.
To remove all rows where column 'score' is < 50 and > 20¶
df = df.drop(df[(df.score < 50) & (df.score > 20)].index)
Create_Dataframe_From_List_of_Dicts¶
pd.DataFrame(d)
Example:
import glob2, os, pandas as pd def process(file_name): pp= [] pp['filename'] = file_name pp['last_modified'] = os.path.getmtime(file_name) return pp
listing = [ process(x) for x in glob2.iglob('*.md') ] df = pd.DataFrame(listing)